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Abstract 

Submodular and fractionally subadditive (or equivalently XOS) functions play a fundamental 
role in combinatorial optimization, algorithmic game theory and machine learning. Motivated 
by learnability of these classes of functions from random examples, we consider the question of 
how well such functions can be approximated by low-degree polynomials in norm over the 
uniform distribution. This question is equivalent to understanding the concentration of Fourier 
weight on low-degree coefficients, a central concept in Fourier analysis. Denoting the smallest 
degree sufficient to approximate / in li norm within e by degl^{f), we show that 

• For any submodular function / : {0,1}" —>■ [0,1], degf^(/) = 0(log(l/e)/e"‘/®) and there is 
a submodular function that requires degree 0.(1/ 

• For any XOS function / : {0,1}" —>■ [0,1], degf^(/) = 0(l/e) and there exists an XOS 
function that requires degree D(l/e). 

This improves on previous approaches that all showed an upper bound of 0(l/e^) for submodular 
[CKKLl^ IFKV181 IFV18) and XOS [FV18j functions. The best previous lower bound was 
0(l/e^^^) for monotone submodular functions [FKV18j . Our techniques reveal new structural 
properties of submodular and XOS functions and the upper bounds lead to nearly optimal PAG 
learning algorithms for these classes of functions. 












1 Introduction 


Analysis of the discrete Fourier transform of functions over the hypercube has a wide range of 
notable applications in theoretical computer science. It is also the object of significant research 
interest in its own right |0’D14| . While most of this research has been devoted to Boolean-valued 
functions, many works analyze general real-valued functions (e.g. |Tal94[ [DFKOOb] ). Recently, the 
analysis of real-valued functions over the hypercube has also attracted signihcant attention due 
to applications in learning theory, property testing, differential privacy, algorithmic game theory 
and quantum complexity [rTHIMOfll iBTml KIHRUIIl [Wm I(1KKL12[ lBDF+12[ IB(1TW121 iRYTm 
IFKV131 IFV131 IFK141 IBRY141 IAA141 IBB14| . Most of the Fourier-analytic techniques apply to 
real-valued functions as well but many new questions arise when one considers the richer structure 
of real-valued functions. 

Our focus is on structural properties of two fundamental classes of real-valued functions: sub- 
modular and fractionally subadditive. Submodularity, a discrete analog of convexity, has played an 
essential role in combinatorial optimization [EdmTOl ILov83[ |Que95[ IFra97t IFFir)l| and, more re¬ 
cently, in algorithmic game theory and machine learning [GKS051IBLN061IDS061IKGGKO^ IKSG081 
IVonn8] . In algorithmic game theory, submodular functions have found application as valuation 
functions with the property of diminishing returns [BLN061IDS061 IVon08| . Along with submodular 
functions, fractionally subadditive functions have been studied in the algorithmic game theory con¬ 
text [BLNOB] (see Sec. [2] for the definition). Feige showed that these functions have an additional 
characterization as a maximum of non-negative linear functions or XOS [FeiOBj . Here we also show 
that the Rademacher complexity of a set of vectors that plays a fundamental role in statistical 
learning gives yet another equivalent way to dehne this class of functions. For comparison, we also 
discuss the class of self-bounding functions that contains both submodular and XOS functions and 
shares a number of properties with those classes such as dimension-free concentration of measure 
[BLMOO] . Informally, a function / : {0,1}” —)• M is self-bounding if for every x E {0,1}”, f{x) 
upper bounds the sum of all the n marginal decreases in the value of the function at x. We dehne 
these classes and their relationships in Section [2l 

The primary property we consider is how well these functions can be approximated by low- 
degree polynomials, where the approximation is measured in £2 norm over the uniform distribution 
U dehned as ||/ — 5 II 2 = \/^u[if{x) — ^(x))^]. By the standard duality for the I 2 norm, approx- 
imability of / by polynomials of degree d is characterized by how much of /’s Fourier weight resides 
on coefficients of degree above d. Concentration of the Fourier spectrum on low-degree coefficients 
is one of the central and most well-studied properties in Fourier analysis and its applications. In 
particular, following the seminal work of Linial, Mansour and Nisan |LMN93j . a large number of 
learning algorithms over the uniform (and other) distributions relies crucially on approximation by 
low-degree polynomials (e.g. |KKMS08l IKS081 lKKM13] i. 

Motivated by learning of submodular functions and its application in differential privacy in 
[GHRIJlIj . Cheraghchi et al. [CKKLl^ proved that every submodular function!}] can be e-approximated 
in £2 norm by a polynomial of degree 0(l/e^). Their proof is based on the analysis of the noise sen¬ 
sitivity of submodular functions, a standard tool from Fourier analysis for establishing low-degree 
spectral concentration. Subsequently, Feldman et al. proved the same upper bound of 0(I/e^) 
using approximation of submodular functions by real-valued decision trees |FKV13] . They also 
gave a lower bound for learning that implies a lower bound of n(l/e^/^) on the degree necessary to 


^Here and below we normalize the function range to [0, 1]. 
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Figure 1: Overview of low-degree approximations: bounds on (£ 2 , e)-approximate degree for a function with 
range [0,1]. 


e-approximate submodular functions. 

Most recently, we considered the approximability of submodular and XOS functions by functions 
of few variables or juntas [FVl3] . We showed that submodular functions are e-approximated in £2 
by functions depending on O(^logi) variables, while for XOS functions, a junta of size ^ 

suffices. In addition, we showed that submodular and XOS functions (in fact, all self-bounding 
functions) have constant total influence implying that they can be approximated by a polynomial 
of degree 0(l/e^). These results have lead to substantially faster learning and testing algorithms for 
these classes of functions, most notably, a -n^ time PAG learning algorithm for submodular 

functions and a 2^^^/^ ^ -n time PAG learning algorithm for XOS functions. Learning of submodular 
and XOS functions is also the main motivating application of this work. 

1.1 Our Results 

In this work, we investigate the degree that is necessary to approximate XOS and submodular 
functions in detail. For a real-valued function / : {0,1}” —)• M let degf^(/) denote the smallest 
d such that there is a polynomial p of degree d for which ||/ — p ||2 < e and we refer to it as 
(^ 2 , e)-approximate degree of /. The three known upper bounds on (£ 2 , e)-approximate degree of 
submodular functions are all 0(l/e^) [GKKLl^ IFKVJ^ IFV13] . The bounds are derived via three 
different approaches suggesting that this might be the right answer. This bound also applies to 
XOS and self-bounding functions [FV13| and the known lower bound of n(l/e^/^) also applies to all 
of these classes of functions [FKV13| . Here we show that, in fact, the picture is substantially richer: 
each of these classes requires a different degree to approximate that corresponds to the increasing 
complexity of functions in these classes. We detail our bounds below and also summarize them in 
Figure m 

• For any submodular function / : {0,1}” —[0,1], degf^(/) = 0(log(l/e)/e^/^). This is 
almost tight: we prove that even for very simple submodular functions of the form f{x) = 

for some k, degl^{f) = 

• For any XOS function / : {0,1}” —)• [0,1], degf^(/) = 0(l/e). We also show that degree 
n(l/e) is necessary for XOS functions. 

For comparison we show that the bounds above do not hold for the more general class of self- 
bounded functions (and, consequently, for functions with constant total influence). Namely, we 
show that there exists a self-bounding function / : {0,1}” —)• [0,1], such that degl^{f) = n(l/e^). 
This matches the upper bound in [FV13| . As an additional point of comparison, coverage functions, 
a subclass of submodular functions, can be approximated by polynomials of exponentially smaller 
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0(log(l/e)) degree |FK14] . At the same time monotone functions and subadditive functions cannot 
be approximated by polynomials of dimension-free degree and require Q.{y/n) and Q.{n) degree, 
respectively, to approximate within a constant. 

As a first application we show that the improved upper bound on degf^ of XOS functions leads 
to an upper bound of on the size of junta sufficient to approximate an XOS function within 

^2 error of e. This improves on the ^ upper bound and matches the lower bound of in 

[FVT3] . 

Our techniques: It is easy to verify that previous approaches to proving upper bounds on degf^ 
cannot lead to upper bounds stronger than 1/e^ even in the case of submodular functions. For 
example, a bound on the total sum of squared influences lnf^(/) leads to degf^(/) < lnf^(/)/e^. 
However, lnf^(/) = 1 even for the monotone submodular function f{x) = xi. 

The first step of both of our upper bounds is a spectral concentration bound based on the 
total second-degree influences. Namely, we consider the quantity Yll j=i where dijf is a 

second-degree discrete partial derivative of /. This quantity measures interactions between pairs of 
variables. It is particularly meaningful in the setting of submodular functions, where it measures 
the drop in marginal value of element i due to the presence of j. That is, we always have dijf < 0 
for submodular functions. We prove that for XOS functions the quantity Y17j=i ll^u/lli 
most a constant. This leads to an upper bound of 0(l/e) on degg^(/), since Y17j=i — 

16 • Yls l'S'P/^('S'). The proof of this bound is based on a careful analysis of contributions of the 
linear functions in the XOS representation. In particular, it also reveals that XOS functions satisfy 
a degree-two version of self-boundedness property: for all x, f {x))'^ < 5(/(x))^ (for 

comparison, self-bounding monotone functions satisfy 

The upper bound above is optimal for XOS functions. To prove this we give an embedding of 
monotone DNF formulas into XOS functions. We then use the high noise sensitivity of Talagrand’s 
random DNF |MO02] to prove our lower bound on the low-degree spectral concentration. 

For submodular functions, we use a different approach to obtain the stronger bound. We 
examine how the sum of second-degree influences Y17j=i ll^u/lli behaves when no individual influ¬ 
ence is too large. The technical notion of “large” that we use is the following “threshold norm”: 
II^i/IIt = sup{a > 0 : Fr[\dif{x)\ > a] > a^}. We prove that at most O(ilog^) partial deriva¬ 
tives can be large in the sense that ||5i/||r > e. This result is a special case of almost-everywhere 
boundedness of almost all the partial derivatives of a submodular function that we show. Namely, 
the number of variables i for which Pr[|5i/(x)| > a] > (5 is at most 0(log(l/(5)/e). To prove 
this result we rely on the “boosting lemma” of Goemans and Vondrak [GVOBj . also used in our 
recent work |FV13] . (We note that an equivalent statement also appeared in [KK07| .i Finally, 
we prove that for submodular functions with partial derivatives bounded by ||9i/||T < Cj we have 
Yllj=i ~ 0{y/e). This leads to the upper bound of 0(l/e^/®). 

As a warm-up to the upper bound for general submodular functions we also show a substan¬ 
tially simpler analysis for totally symmetric submodular functions (functions invariant under per¬ 
mutations of variables). In this case we avoid the logarithmic factor and get an 0(l/e^/®) upper 
bound. While the exponent of e in our upper bound is quite unexpected it is actually optimal. 
In particular, using direct estimation of spectral concentration we show that the simple function 
f{x) = min{| Yli=i 1} requires degree D(l/e^/®) for k = 0(l/e^/^). This function is monotone, 
totally symmetric, budget-additive and also can be viewed as a scaled rank function of a uniform 
matroid. Hence the lower bound applies to these subclasses of submodular functions as well. We 


3 














remark that the weaker lower bound of n(l/e^/^) in |FKV13| was given for same function. 

Finally, the lower bound of 0(l/e^) for self-bounding functions is based on an embedding of 
any Boolean function into a self-bounding Boolean function over n + log(n) -|- 0(1) variables using 
the Hamming error-correcting code of distance 3. 

Learning: The new structural results directly translate into improved learning algorithms using 
the techniques from |FKV13l IFV13| . For brevity we describe the improvements for learning from 
random examples in the PAG model with (.2 error. Similar improvements can be obtained for 
agnostic learning and learning with value queries (which allow the learner to ask for the value of 
the function at any point). For both XOS and submodular functions we give a new lower bound 
which shows that our learning algorithms are close to optimal. 

Theorem 1.1 There exists an algorithm A that given e > 0 and aceess to random uniform ex¬ 
amples of a submodular XOS funetion f : {0,1}"' —)• [0,1], with probability at least 2/3, outputs a 
function h, such that \\f — h \\2 < e. Further, A runs in time ■ r? and uses logn 

random examples. 

The best previous algorithm for this task runs in time ■ r? and uses logn random 

examples [FV13] . We complement the new learning upper bound by a nearly tight information- 
theoretic lower bound of examples (of value queries) for any PAG learning algorithm (see 

Thm. [6^ for a formal statement). 

The proof of the lower bound relies on the reduction in [FKV13| . The improved polynomial 
approximation and junta size for XOS functions lead to the following improved PAG learning 
algorithm. 

Theorem 1.2 There exists an algorithm A that given e > 0 and access to random uniform exam¬ 
ples of an XOS function f : {0,1}” —)• [0,1], with probability at least 2/3, outputs a function h, 
such that 11/ — h \\2 < e. Further, A runs in time ■ n and uses logn random examples, 

where r = minjn, 2^/^}. 

The best previous upper bound was polynomial in for r = minjn, 2^/'^^} [FV13| . We prove 

that any PAG algorithm for XOS functions requires 2^^^/^^ examples(see Thm. 16.101 for a formal 
statement). This upper bound is close to being tight when n is subexponential in 2^/*^. The lower 
bound is based on the embedding of monotone DNF into XOS functions that we used in the lower 
bound on degf^ together with the lower bound for learning monotone DNF of Blum et al. [BBL98] . 
Finally, using the Hamming code-based embedding we mentioned above we give a stronger lower 
bound of 2^^^/*^ ^ examples for any PAG learning algorithm for learning self-bounding functions. 

Organization: Following the preliminaries we present the proofs of our main upper bounds: in 
Section [3] for XOS functions and in Section H] for submodular functions. Applications to approxi¬ 
mation of XOS functions by juntas and learning algorithms appear in Section [5l Details of lower 
bounds on spectral concentration and learning appear in Section [6l In Appendix we prove the 
equivalence of Rademacher complexity and XOS functions. 

1.2 Related Work 

Analysis of functions on the Boolean hypercube has a long history with strong ties to combina¬ 
torics, probability, learning theory, cryptography and complexity theory (see |0’D14j l. One of 
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the fundamental and most well-studied properties of Boolean functions is monotonicity. There is 
now a rich and detailed literature on the structure of the Fourier spectrum of monotone Boolean 
functions and their learnability over the uniform distribution |KLV94[ ITal94[ ITal96[ IBT961 IBBL98[ 
IM()n2[ [crmil IAMOBI IDT M+OSI lOWI.Il IDSFT+15] . starting with the work of Goldreich et al 
[GGL~*~00] numerous works have also investigated testing of monotone functions over the Boolean 
hypercube. Submodularity is closely related to monotonicity: indeed a function is submodular if 
and only if its partial derivatives are monotone non-increasing. In addition, XOS functions which 
are monotone share structural similarities with monotone DNF formulas (we make this explicit in 
Section 16.21) . Hence our work is both inspired by the research on understanding of monotonicity 
over the Boolean hypercube and builds on techniques and results developed in that research. At 
the same time we are not aware of techniques closely related to those we use to prove our upper 
bounds for submodular and XOS functions having been used before. 


We now review some recent work on learning of submodular, XOS and related classes of real¬ 
valued functions. Reconstruction of submodular functions up to some multiplicative factor (on 
every point) from value queries was first considered by Goemans et al. |GHIM09] . They show a 
polynomial-time algorithm for reconstructing monotone submodular functions with 0{^/n)-^actoI 
approximation and prove a nearly matching lower-bound. This was extended to the class of all sub¬ 
additive functions in BDF~*~1^ which studies small-size approximate representations of valuation 
functions (referred to as sketches). 


Motivated by applications in economics, Balcan and Harvey initiated the study of learning 
submodular functions from random examples coming from an unknown distribution and introduced 
the PMAC learning model that requires a multiplicative approximation to the target function on 
most of the domain |BH12j . They give an 0(-y/n)-factor PMAC learning algorithm and show an 
information-theoretic n(-^^)-fact or impossibility result for submodular functions. Subsequently, 
Balcan et al. gave a distribution-independent PMAC learning algorithm for XOS functions that 
achieves an 0(-yn)-approximation and showed that this is essentially optimal [BCIWl^ . 


Learning of submodular functions with additive rather than multiplicative guarantees over the 
uniform distribution was hrst considered by Gupta et al. who were motivated by applications in 
private data release [GHBUlT] . They show that submodular functions can be e-approximated by a 
collection of 1 e^-Lipschitz submodular functions. Concentration properties imply that each 

e^-Lipschitz submodular function can be e-approximated by a constant. This leads to a learning 
algorithm running in time ), which however requires value queries in order to build the 

collection. Using the upper bound of 0(l/e^) on degf^ of submodular functions Cheraghchi et 
al. gave a rp^^!'^ ) learning algorithm which uses only random examples and, in addition, works 
in the agnostic setting [CKKLl^ . Feldman et al. show that the decomposition from [GHR.UlT] 
can be computed by a low-rank binary decision tree [FKV13| . They then show that this decision 
tree can then be pruned to obtain depth 0(l/e^) decision tree that approximates a submodular 
function. This construction implies approximation by a ^-junta of degree 0(l/e^). They 

also show how approximation by a junta can be used to obtain a 2^^^/^ ^ PAG learning algorithm 
for submodular functions. Feldman et al. extend the results on noise sensitivity of submodular 
functions in [CKKLl^ to all self-bounding functions and show that they imply approximation 
within distance of e by a polynomial of 0(log(l/e)/e) degree and -junta [FKV14] . 

Note that approximation in ^2 norm we give here is stronger and our lower bound for self-bounding 
functions implies that any approach that works for all self-bounding functions cannot improve on 
the 0(l/e^) bound on degf^. 
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Raskhodnikova and Yaroslavtsev consider learning and testing of submodular functions taking 
values in the range {0,1,... , A:} (referred to as pseudo-Boolean) |RY13j . The error of a hypothesis in 
their framework is the probability that the hypothesis disagrees with the unknown function. They 
show that pseudo-Boolean submodular functions can be expressed as 2A:-DNF and thus obtain a 
poly(n) • PAC learning algorithm using value queries. In a subsequent work, Blais 

et al. prove existence of a junta of size (/c log(l/e))^^^^ and use it to give an algorithm for testing 
submodularity using (A: log(l/e))‘^(*^) value queries |BOSY13] . 

2 Preliminaries 

Let us define submodular, fractionally subadditive and subadditive functions. These classes are well 
known in combinatorial optimization and there has been a lot of recent interest in these functions 
in algorithmic game theory, due to their expressive power as valuations of self-interested agents. 

Definition 2.1 A set function f : 2^ is 

• monotone, if f{A) < f{B) for all AC B C N. 

• submodular, if f{A U B) -|- f{A H B) < f{A) -|- f{B) for all A,BCN. 

• subadditive, if f{A U B) < f{A) -|- f{B) for all A C B C N. 

• fractionally subadditive, if f{A) < whenever /?* > 0 and Yli a&Bi A > 1 Va G A. 

We identify functions on {0,1}"" with set functions on Y = [n] in a natural way. By 0 and 
1, we denote the all-zeroes and all-ones vectors in {0,1}” respectively. Submodular functions are 
not necessarily nonnegative, but in many applications (especially when considering multiplicative 
approximations), this is a natural assumption. All our approximations are shift-invariant and 
hence also apply to submodular functions with range [—1/2,1/2] (and can also be scaled in a 
straightforward way). Fractionally subadditive functions are nonnegative by definition (by con¬ 
sidering A = Bi,/3i > 1) and satisfy /(O) = 0 (by considering A = = 0,/3i = 0). There 

is an equivalent definition known as “XOS” or maximum of non-negative linear functions [Fei06] : 
f{x) = maXcecY17=i'’^ciXi- Here, Wd > 0 are nonnegative weights. This class includes all (non¬ 
negative) monotone submodular functions such that /(O) = 0 (but does not contain non-monotone 
functions). In Appendix]^ we show that Rademacher complexity of a set of vectors, a powerful 
and well-studied tool in statistical learning theory [KPOO] IBMn2] , gives an equivalent way to define 
XOS functions. We also show that the class of monotone self-bounding functions is stricly broader 
than than of XOS functions. 

A broader class is that of self-bounding functions. Self-bounding functions were defined by 
Boucheron, Lugosi and Massart [BLMOnj and further generalized by McDiarmid and Reed [MR.OB] 
as a unifying class of functions that enjoy strong concentration properties. Here, we define self- 
bounding functions in the special case of {0,1}” as follows. A function / : {0, !}”■ ^ M is called 
a-self-bounding, if / is 1-Lipschitz and for all x € {0,1}”, 

n 

i=l 

where x © is x with i-th bit flipped and (a)+ denotes max{0,a}. The 1-Lipschitz condition 
does not play a role in this paper, as we normalize functions to have values in the [0,1] range. 
Self-bounding functions subsume fractionally subadditive functions, and 2-self-bounding functions 
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subsume (possibly non-monotone) submodular functions. See |FV13] for a more detailed discussion 
of these classes of functions. 

The and ^ 2 -norms of / : {0,1}"’ —)• M are defined by ||/||i = 'Eixr^uWf{x)W and ||/||2 = 
{Fix^u[f 1 respectively, where U is the uniform distribution. 

Definition 2.2 (Discrete derivatives) For x E {0,1}"', b E {0,1} and i £ n, let denote 
the vector in {0,1}"' that equals x with i-th coordinate set to b. For a function f : {0,1}” —)• M and 
index i E [n] we define dif{x) = /(xi<_i) — f{xi^o). We also define dijf{x) = didjf{x). 

A function is monotone (non-decreasing) if and only if for all i E [n] and x E {0,1}”, dif{x) > 0. 
For a submodular function, dijf{x) < 0, by considering the submodularity condition for 
1 ) Xi^ij^o, and 

Absolute error vs. error relative to norm: In our results, we typically assume that the 
values of f{x) are in a bounded interval [0,1], and our goal is to learn / with an additive error 
of e. Some prior work considered an error relative to the norm of /, for example at most e ||/||2 
[CKKLl^ . In fact, it is known that for a non-negative submodular, XOS or self-bounding function 
/, II/II 2 = II(||/||oo) [FeiObl IFMVOTt iFKVldj and hence this does not make much difference. If we 
scale f{x) by 4 [|j||^) we obtain a function with values in [0,1] and learning the original function 
within an additive error of e ||/||2 is equivalent to learning the scaled function within an error of 
e/4. 

Fourier Analysis: We rely on the standard Fourier transform representation of real-valued func¬ 
tions over {0,1}”' as linear combinations of parity functions. For S C [n], the parity function 
Xs ■ {0,1}” —>■ {—1,1} is dehned by Xsix) = (—The Fourier expansion of / is given by 
fix) = Z]sc[n] fi^)xsix)- The Fourier degree of / is the largest [S’] such that f{S) 0. Note 
that Fourier degree of / is exactly the polynomial degree of / when viewed over {—1,1}” instead 
of {0,1}” and therefore it is also equal to the polynomial degree of / over {0,1}”. 

For degree d, let W^{f) = Esc[n], \s\=difiS)f and = Yli>d^'if)- For any function 

/, Parseval’s identity states that jj/lll = X]5c[n](/(‘5’))^' This implies that the degree d polynomial 
closest in £2 distance to / is precisely p{x) = Escfn], |S|<d /('S')xs(a:) and ||/ -p||2 = y/W>‘^{f). 
In other words, degl^{f) = d if and only if d is the smallest such that < e^. 

Observe that: dif{x) = -2^s^J{S)xs\{i}ix), and dijf{x) = 4 Essi.j/('S')xs\{ij}(a;)- 

3 Degree 0(l/e) approximation for XOS functions 

In this section, we consider XOS functions / : {0,1}” M+, f{x) = maXc£cYfi=i'^ciXi, where, 

Wei > 0 are nonnegative weights. We call each c E C a clause of the XOS function. 

We recall that XOS functions, and more generally self-bounding functions, satisfy the following 
inequality for each x E {0,1}”; 'f27=iifi^) ~ fi^ ® ®*))+ — /(^)- particular, for XOS functions 
(which are monotone), this can be written as 

difix) < fix). (1) 

i:Xi=l 

This leads to a bound of the form Xs l'S'|/^(5') = Odl/Hi)) which implies that degree 0(l/e^) is 
sufficient to approximate XOS functions within ^ 2 -error e. Here, we aim to improve the degree 
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bound from 0(l/e^) to 0(l/e). For this purpose, we seek a “second-degree variant” of inequality 
dm, using the second-degree derivatives 


dijf{x') — 0,j<—l) “1“ 0,j<—o)- 

(For i = j, we define diif{x) = 0.) Our plan is to use these expressions as follows. 

Lemma 3.1 For any function f : {0,1}"' —>■ M+ and any 1 < k < n, 

1 "■ 

E Fs) < E 11%/ili' 

|5|>fc i,i=l 

Proof: For every pair i ^ j G [n], we have /^(5). Summing up over all 

choices oi i ^ j, each set S appears jS'KISI — 1) times: 

i^j SC[n] 

Therefore, we obtain Yl lj=i WdijfWl > 16Esc[n] I'S'KI'S'I - l)/^(5') > □ 

Our goal in the following is to bound the expression Ya j=i ll^ij/lli- Fhst we prove the following. 

Lemma 3.2 For an XOS function f : {0, !}"■ ^ M+ and any x G {0, !}"■, 

E ^ (2) 

iJ:Xi=Xj=l 

Proof: Let S denote the set of coordinates such that x* = 1. Let c G C be a clause that achieves the 
maximum, defining /(x) = Yjes (if there are multiple such clauses, fix one arbitrarily). Fix any 
i G S and define o' G C to be a clause achieving the maximum that defines f{xi^o) = Yj^s\{i} 

Fix another j G S. We claim the following bounds: 

- minlwci + rccj, Wc'j} < dijf{x) < mm{wci,Wcj}. (3) 

First, assume that dijf{x) > 0. Since / is monotone, we have dijf{x) < mm{dif{x),djf{x)}. 
Since c is the clause defining /(x), /(x) cannot decrease by more than Wd when flipping Xj from 1 
to 0. Similarly, /(x) cannot decrease by more than Wcj when flipping Xj from 1 to 0. Therefore, 
dijf{x) < mii\{wci,Wcj]■ 

Second, assume that dijf{x) < 0. Here we have dijf{x) > — inm{dif{xj^o),djf{xi^o)}. Recall 
that after flipping Xj from 1 to 0, c( is a maximizing clause, and therefore 9j/(xj<_o) = /(xj<_o,j<-i) — 
/(xj<_o,j<-o) < Wdj- To bound dif{xj^o), we use the following (by monotonicity): clj/(xj<_o) = 
/(xj^ij^o) - f{xi^o,j^o) < f{xi^i,j^i) - /(xj^oj^o) < Wei + Wcj, using the fact that c is a 
maximizing clause for /(x). (We remark that although this seems like a weak bound, it could be 
actually tight.) This proves ([3]). 

Next, we sum up over all pairs of coordinates i,j G S. Note that c G C is fixed before choosing 
i,j, and we can assume for convenience that the coordinates are ordered so that i < j implies 



Wei < Wcj- We have 


(%/(®))' 

i,jes 


= Y1 idijfix)f + 2 {dijf{x)f 

i,jeS:dijf{x)>0 i,jeS:i>j,dijf{x)<0 

< Y '^ciWej +2 Y i'^ci + Wcj)We'.j 

i,jGS i,jeS,i>j 

< Y '^ciWej +4 Y WciWe'.j 

i,j&S i,j&S,i>j 

= +4Y[wci Y 


\i&S 


Wei 

i&S \ j&S,j<i 
2 


< {f{x)f + A{f{x)f = h{f{x)) 
since Yhjes'^Cj < f{x) for every clause c' € C. 

Lemma 3.3 For any XOS function f : {0,1}" —>■ M+, 


□ 


E Il»y/ll2 ^ 20||/||^. 

*J=1 


Proof: Since all norms here are over the uniform distribution, we have ||<9ij/||2 = ^ Z^xelo 

Note that Lemma 13.21 counts only the contributions from points such that Xi = Xj = 1. However, 

dijf{x) does not depend on the values of Xi and xj. Therefore, we can write equivalently 

WdijfWl = ^ Y1 idijfix)f- 

x£{0,l}^:Xi=Xj = l 

Summing up over all i,j and switching the sums, we get 

" 4 ^ 4 

WdijfWl = ^Y1 Y1 (dijfix))^ = ^ Y1 Y1 (^b/(^))'- 

*J=1 i,j=ixG{0,l}^-.Xi=Xj=l xe{0,l}” i,j-Xi=Xj=l 

Now, we can apply Lemma [32] to conclude that Y17j=i IIII 2 — ^ Z^xefo i}" 5(/(x))2 = 20||/||i. 
□ 

We can conclude as follows. 

Corollary 3.4 For any XOS function f : {0, 1}"' ^ [0, 1], there is a polynomial p of degree 0(l/e) 
such that II/ — p ||2 < £• 

Proof: By Lemma [3l3l we have Yllj=i\\^ijf\\^ — since ||/||2 < 1 here. Therefore, applying 

Lemma [3Tl Yl\s\>kPi^) — choose k = \/5/(2e), which ensures that Yl\s\>k 

and therefore the polynomial consisting of all terms up to degree k approximates / within f 2 -error 

e. □ 
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4 Degree 0(l/e^/^) approximation for submodular functions 

In this section, we show that the 0(l/e) degree approximation for XOS functions can be improved 
to 0(l/e^/®) for submodular functions. Interestingly, turns out to be the right answer for 

submodular functions (ignoring logarithmic factors). 

We build on the technique of bounding Ylij which in the case of submodular functions 

seems particularly appropriate since we know that dijf{x) < 0 for every x G {0,1}”, which simplifies 
certain expressions. However, Lemma 13.31 itself cannot be improved to a sub-constant bound 
— it is easy to see that Ylij\\9ijf\\2 could be at least ||/||| for a submodular function (e.g., 
f{x) = 1 — (1 — xi)(l — X 2 )). However, as we show below this can happen only when some variables 
have a very large influence. Our goal is to handle such variables separately and prove that under 
the assumption of low influences, the quantity Ylij cannot be large. 

Once we can control the influences of individual variables (for now imagine that we can control 
ll^i/lb)) we use the following way of bounding the sum of second partial derivatives. 

Lemma 4.1 For any submodular function f : {0, 1}” — ?• M, any coordinate i and a subset of 
coordinates A, 

Ell««/lllS2vf4l ||8i/|b. 

j£A 

Note the improvement from 21^41 ||9j/||i (which is trivial) to 2-^/P4[||cIi/||2 on the right-hand-side. 
Proof: Since / is submodular, we have dijf{x) < 0, and 


X] ll^b'/lli = '^'^xr.u[-dijf{x)] = 2'^E^r.u[{-'^T^dif{x)] = 2 • E^r^u[dif{x)g{x)] < 2||9i/||2||fif||2 
j&A j&A j&A 

where g{x) = 1)^^ and we used the Cauchy-Schwartz inequality at the end. It is easy to 

check that ||( 7||2 = y^| A| which proves the lemma. □ 

First, let us sketch how this argument leads to an 0(l/e^/®) bound in the case of totally sym¬ 
metric submodular functions, to illustrate some of the ideas employed in the general case. 

Totally symmetric submodular functions. Let us assume that / : {0,1}"^ -A- [0,1] is totally 
symmetric in the sense that f{x) depends only on Note that such a function is simply 

a concave function of we observe that the influences of individual variables in this 

case cannot be too large. 

Lemma 4.2 For any totally symmetric submodular function f : {0,1}” ^ [0,1] and x G {0,1}"" 
such that I < have \dif{x)\ < ^ for all i G [n]. 

Proof: Assume that dif{x) > ^ (the opposite case is similar). Since the function is totally 
symmetric, we actually have djf{x) > ^ for every j G [n]. Also, Y17=iXi > f- By submodularity, 
f{x) — /(O) > > n Si=i ^3 — ^his contradicts the fact that the values of f{x) 

are in [0,1]. □ 

To simplify the analysis, let us assume that in fact, \dif{x)\ = O(^) for all i G [n] and all 
X G {0,1}”. This can be accomplished by modifying the function in the regions where XlILi < f 
or > ^ in such a way that dif{x) is constant in each region. For example, if t is maximum such that 
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dif{x') > ^ for /(^O = ^ this point x' (and we know that t = Yll=i ^'i < f)- 

We can set f{x) = F—^{t — Y17=i^i) whenever X^ILi < t. Similarly, we adjust the function 
for > X' These are sets of small measure (under the uniform distribution) and so any 

approximation of the modified function also works well for the original function. In the following, 
we assume that \dif{x)\ = O(^) everywhere. Now we can show the following bound. 

Lemma 4.3 If \dif{x)\ = 0{^) for all i G [n] and x G {0,1}^', then \\dijf\\l = O . 

Proof: Note that the assumption on partial derivatives also implies that \dijf{x)\ = 0(^}. We 
estimate J2?j=i ll^ij/lli follows: 

E Il%/ll2 = o(},)E II««/IIi = o (ts) E II«‘/II2 

i,j = l i,j=l 2=1 

using Lemma l4.1l with A = [n]. Since we assume that \dif{x)\ = O(^), it follows that ||9i/||2 = O(^) 
for all i G [n] which proves the lemma. □ 

Now we can apply the method of bounding the Fourier tail above a certain level using LemmaET] 

We choose k = l/(en^/^) in order to make the Fourier tail bounded by O(e^) as it should be. 
Finally, note that if n < 1/e^/®, we can take trivially a polynomial of degree n. Therefore, the 
non-trivial case is when n > and then we have k = l/(en^/^) < 1/e^/^. This proves that 

degree 0(l/e^/^) is sufficient for totally symmetric submodular functions. 

General submodular functions. Let us turn now to the case of general submodular functions. 
The main complication here is that some variables can have large influences and we need to handle 
those separately. The main technical lemma here is that there cannot be too many variables of 
large influence, measured in a suitable way. The most technical part of the proof is to prove that 
there cannot be too many variables of large influence, and the influences decay relatively fast as 
we consider more variables. We also have to define “influence” in a suitable way. We denote by 
fip a product distribution on {0,1}” such that I^rxr^pj,{xi = 1] = p for each f G [n]. We prove the 
following. 

Lemma 4.4 Let f : {0,1}” [0,1] be a submodular function, 0 < <5, e < 1, and let 

J{€,5) = {ie[n]: Pr [|di/(x)| > e] > 5}. 

a;~/ii/2 

Then |J(e,5)| = 0(Mog|). 

We prove this using the “boosting lemma” of |GV06] (which was already used for the purpose 
of approximating submodular functions by juntas in |FV13j l. 

Boosting Lemma. Let T C {0,1}^ be down-monotone (if x ^ T and y < x coordinate-wise, 
then y G F). For p G (0,1), define Op = G T\. Then ap = (1 — p)^^') where (f{p) is a 

non-decreasing function for p G (0,1). 

Proof: [of Lemma 03] Let 
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• J+(e,5) = {i E [n] : > e] > 5/2}. 

• J~{e,5) = {i E [n] : PTa,r^^.^^^[dif{x) < -e] > 5/2}. 

We have J{e,5) C J+(e, 5) U J“(e, 5). Hence it is enough to bound | (5)|; the same bound on 

|J“(e, (5)1 follows by considering the function f{x) = /(I — x). 

For each j E [n], define 

J-+ = {x E {0,1}*" : djf{x) > e}. 

By submodularity, this set is down-monotone. By assumption, we have ^ ^ <^72 

for j E J~^{€,5). Using the terminology of the boosting lemma, we have (Ti /2 = where 

(/)(l/2) < log 2 ( 2 /( 5 ). We define q = 1 — (l/ 2 )^/'°S 2 ( 2 /< 5 ) < \j2. By the boosting lemma [GVObj . we 
have 

Pr [x E Ff] = (1 - > (1 - g)'°g 2 ( 2 /< 5 ) ^ 1 

J 2 

for each j E J^(e, (5). We also have Pr^,..^^^ [xj = 1] = q. Note that Xj = 1 and x E J))*" are 
independent events, since x E Fj~ depends only on djf{x) and this is independent of Xj. Therefore, 

Pr [x,- = 1 & X E Ff] > ^ 

Xr^flq -'2 

for each j E T^(e, (5). Let L{x) = {j : Xj = 1 &: x E F '^}. We have 

E.^^,[|L(x)|] > E J+(e,,5) : x,- = 1 & x E J-+}|] > ||J+(e,(5)|. 

On the other hand, denoting by I 5 the indicator vector of S', for each j E L{x), we have > 

e and therefore 

e|L(x)| < ^ < f{lL{x)) < 1 

jeL{x) 

where we used submodularity in the second inequality. This means that |T(x)| < 1/e with proba¬ 
bility 1. Therefore, we have |J+(e,(5)| < Recall that q = 1 — (l/ 2 )^/^°S 2 ( 2 /< 5 ) > ^ (using 

(5 < 1) which means | J’'"(e, (5)| < | log 2 |. □ 

We use Lemma 14.41 for two purposes. First, it allows us to take out a small set of variables L 
whose derivatives can be large with large probability. Conditioned on these variables, we get an 
“almost e-Lipschitz” function, for which using Lemma 14.41 again allows us to prove an improved 
bound on \\dijf\\l- 

We introduce the following notation (the “threshold norm”). In the following, all probabilities 
and expectations are over the uniform distribution (x U). 

Definition 4.5 For a function f : {0, !}”■ ^ M, we define 

II/IIt = sup{q; : Pr[|/(x)| > a] > a^}. 

X 


We remark that ||/||r is not really a norm — it is not linear under scalar multiplication. In 
fact ||/||r is never more than 1. The choice of is somewhat arbitrary here. The notation ||/||t 
is convenient for our proof but in general we do not attribute any significance to it. Lemma 14.41 
(with 5 = e^) implies the following. 


12 





Corollary 4.6 For a submodular function f : {0,1}"' [0,1], the number of coordinates with 

WdifWr > e is at most 0{\ log i). 

We also have the following useful property (which we apply to h = dif). 

Lemma 4.7 For any h : {0,1}" —>• [—1,1], ||/i ||2 < \/2||/i||r- 

Proof: Suppose that ||/i||t = 'h note that 0 < rj < 1. For every a > t], we have by definition 
Pr[||/i(x)|| > a] < a^]. Consequently 

\\h \\2 = E[(/i(x))^] < ■ Pr[|/i(x)| < a] + 1 • Pr[|/i(x)| > a] < . 

Since this holds for every a > rj, we also have ||/i||| < + r]^ < 2r]^. □ 

The following is our main bound on the quantity Ylij 

Lemma 4.8 Let f : {0,1}" [0,1] be a submodular function such that IISi/ljT < a for all i G S. 

Then 

WdijfWl = O [valog^/^ . 

id&s ^ 

Proof: First, note that coordinates i G S such that ||5j/||r = 0 do not contribute anything to the 
sum Yli j&s This is because if ||9i/||T = 0 then dif{x) = 0 for all x G {0,1}" and hence 

also dijf{x) = 0 for every other coordinate j G S. Therefore, we can assume that ||9j/||T > 0 for 
all i G S. 

Let us partition the coordinates as follows. For each £ > 0, let 

Bk = {iGS: \\dif\\T > 2-M- 

Note that by assumption, Bq = 0, the sets Bk form a chain and each i G S belongs to B^ for large 
enough k. By Corollary 14.61 the sets Bj. are bounded in size: 

We define Ak = Bk+i \ Bk‘, clearly, each coordinate belongs to exactly one set Ak,k > 0, and we 
have ||9i/||T < 2~^a for each i G A^. 

We estimate YlijeS II^L'/lli follows. We can write 

\9ijf{x)\ = \dif{xj^i) -dif{xj^o)\ < \dif{xj^i)\ + \dif{xj^o)\. 


Therefore, 


WdijfWl = B,,[\dijfix)\'^] < B,j:[{\dif{xj^i)\ + \dif{xj^o)\) ■ \dijfix)\]. 

Assuming that i G Ak, we know that ||i9i/||r < 2~^a, and hence Prx[\dif{x)\ > 2^~^a\ < 2~^^a^. 
Therefore, we also have Pr3;[|9j/(xj<_i)| + \dif{xi^o)\ > Hence for i G Ak we 

can estimate 

WdijfWl < E^[{\dif{xj^i)\ + \dif{xj^o)\) ■ \dijf{x)\] < 2^~^a ■ E^[\dij f {x)\] + 
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using the trivial bound that \dijf{x)\ < 2 in the case where |clj/(xi<_i)| + \dif{xi^o)\ is large. 
Overall, we obtain 


E ii%/ii2 

i,j&S 


s 2 E E E 11%/iii 

0<e<ki&Ak j&Ai 

^ 2E E E (2"'‘“'E.[|8.;/Wll + 2"-“a=) 

o<e<kieAk jeAt 

k>0 i&Ak 1=0jeAt 0<l<k 


Here we use the bounds |Aa:| < \Bk+i\ = 0(^ log to estimate the second term. We get (up to 
constant factors) J2o<e<k ' 2*^+^(A: + log ^){£ + log < Ylk>o ^ = 0{alog^ ^). 

Hence, we get 


E ll^b'/lli < E^^ ''a E E E + 0 falog^^y (4) 

i,jes k>0 ieAf, 1=0 jeAf: ^ ' 


Here we use Lemma [4.11 to estimate Yle=oYlj^Ae ll^*j/lli- 
ULo = Bk+i- By Lemma [4Hl 

k 

EEii%/iii= E ii%/iii < 2V1 ^ii8,/i|2 

i=ojeAt jeBk+i 


Recall that the H^’s are disjoint and 

y V“ « / 


Assuming i G Ak, Lemma [121 says ||5i/||2 < v^||5i/||r < \/2^. Also, |Afc| = O(^log^), so we 
get 


EEE 

iGAk i=0 jeAi 




/lli = 0 


la a 


■ E iia/fc 

ieAk 


O 


' 2k/2 


a 


log3/2 



Continuing the computation from equation (jl]), we have 


E ll^b'/lli 

i,jeS 


< 


E2 

k>0 


3—k 


“EEE 

ieAk i=0 j&At 




/111 +0 alog^ - 

a 


= ° (|1 ^ ‘“A" ^) + ® (“ s) 

= O ^Valog^/^ . 


□ 

Theorem 4.9 For any submodular function f : {0,1}” -A [0,1], there is a polynomial of degree 
0(^/5 log I) such that \\f - p \\2 < e. 
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Proof: Let a = Let L be the set of variables i E [n] such that ||9i/||T > «• By Corollary 14.61 
the number of such variables is \L\ = 0{^ log = 0{-^ log i). By Lemma IT 8 l (for S' = [n] \ L), 

we have Yli j^iW^ijfWl = 0{^/alog^^‘^ = 0{e^/^ \). On the other hand (recalling that 

||< 9 ii /||2 = 0 and \\dijf\\l = 16 Es 3 {ij} for i / j), 

E ii^b/ii2 = 16 E \s\ms\L\-i)p{s)>iQ E 

S:\S\L\>2 S:\S\L\>k 


We set k = log 1 and obtain 

1 


E 

S:\S\L\>k 


E l|8«/lli = ■ O p 1) = O log-‘/^ 1 


16fc2 II 


For e > 0 sufficiently small, this is less than e^. Therefore, the polynomial 

V{x) = E PiS)xs{x) 

S:\S\L\<k 

satisfies 

Il/-Pll2= E 

S:\S\L\>k 

and has degree \L\ + k = 0{-^ log 1). 


□ 


5 Applications 

5.1 Approximation of XOS functions by juntas 

We use several notions of influence of a variable on a real-valued function which are based on the 
standard notion of influence for Boolean functions (e.g. [BOL85( IKKL 88 ] ). 

Definition 5.1 (Influences) For a real-valued f : {0,1}” —>■ M, i € [n], and k > 0 we define the 
£;j-influence of variable i as lnff(/) = ||^5i/||^ = B[\^difp. We define lnf”(/) = Ylie[n] fol^''(/)■ 

The most commonly used notion of influence for real-valued functions is the ^I’foUuence which 
satisfies 


lnf?(/) = 


-dif 

2 




SBi 


From here, the total f^-foUuence is equal to lnf^(/) = |5|/^(5). We use the following general¬ 

ization of Friedgut’s theorem [Fri98| from |FV13| . 


Theorem 5.2 ( |FV13] i Let f : (0,1} 

that Y.\s\>dfiSf ^ ^ 2/22 foi 

I = {ie[n] I lnf”(/) > a} for 

/ J 1 O \'^/(2—k) 

a = [{k — 1)^ ■ e^/(2 • Inf" 


be any function, e E (0,1) and k E (1,2). For d such 


Then for the set = {S F I \ [S'! < d} we have 
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Finally, to apply this generalization we need a bound on the total influence of any XOS function: 

Lemma 5.3 Let f : {0,1}"’ —)• M+ be an XOS function. Then lnf^(/) < ||/||i. In particular, for 
an XOS function f : {0, 1}"' — )• [0, 1], for all k> I, lnP(/) < lnf^(/) < 1. 

Combining these results with the degree bounds from Corollary 13.41 gives the following bound: 

Corollary 5.4 Let f : {0,1}” — )• [0, 1] be an XOS function and e > 0. There exists a 2^-junta 
p of Fourier degree 0{l/e), such that ||/ — p ||2 < £■ In particular, the spectral ii-norm of p is 
\\p\h^Esa„-i\p(S)\^20W’">. 

Proof: By Corollary 13.41 we can use d = 0(l/e) in Theorem 15.21 and we choose k = 4/3. Let 
a = ^(1/3)'^“^ • e ^/(2 • lnf^/^(/))^ be the lower bound on the influence of variables in the junta 

given in Theorem 15.21 By Lemma [5l3l lnf^/^(/) < 1. Note that g = fi^)xs is a function of 

Fourier degree d that depends only on variables in I. Further, ||/ — 5 II 2 < and the set I has size 
0 jt most 

|/| < lnC^/^(/)/a < . (2/6^)^ = 

□ 


5.2 Applications to Learning 

5.2.1 Preliminaries: Models of Learning 

We consider two models of learning based on the PAC model [Val84| which assumes that the learner 
has access to random examples of an unknown function from a known class of functions. Here we 
only consider learning over the uniform distribution over { 0 , 1 }"' and hence simplify the definitions 
for this setting. 

Definition 5.5 (PAC learning with f' 2 -error) LetF be a class of real-valued functions on{0,1}"' 
An algorithm A PAC learns F with I 2 error overU, if given e > 0, for every target function / G T", 
given access to random independent samples from U labeled by f, with probability at least 2/3, A 
returns a hypothesis h such that ||/ — h ||2 < e. 

Definition 5.6 (Agnostic learning with f' 2 -error) Let F be a class of real-valued functions on 
{0,1}". For any distribution V over {0,1}” x [0,1], let opt{V,F) be defined as: 

opt{fP,F) = - fix)y]. 

An algorithm A, is said to agnostically learn F with £2 excess error overlA if for every e > 0 and 
any distribution V on {0,1}"" x [0,1] such that the marginal of V on {0, !}”■ is lA, given access to 
random independent examples drawn from P, with probability at least |, A outputs a hypothesis h 
such that 

-1)“^] < opt{V,F) + e. 

We remark that one can also define optimality with respect to labels from a different range. For 
simplicity we use the [ 0 , 1 ] range since that is also the range of the functions we consider. 

For both PAC and agnostic learning we will rely on the fact that polynomials of degree d over n 
variables can be learned agnostically in time polynomial in (e • n/d)'^. For the uniform distribution 
this follows from the agnostic properties of the low-degree algorithm by Linial et al. |LMN93j 
observed by Kearns et al. |KSS94| . 
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Theorem 5.7 Let Ltd be a class of all degree d polynomials over n variables of £ 2 -norm at most 1. 
Then LLd can be learned agnostically over U with excess I 2 error of e in time polynomial in t and 
1 /e, where t = J2i=o (1) = 0((e ■ n/dY). 

We remark that this result also holds over arbitrary distributions and follows from the standard 
uniform convergence bounds for linear models with squared loss (e.g. |KST08| b 


5.2.2 PAC and Agnostic Learning of Submodular and XOS Functions 

The algorithms for PAC learning of submodular and XOS functions in |FV13j are based on two 
steps: 

1. Identify a set of influential variables J such that there exists a submodular (or XOS accord¬ 
ingly) function h that depends only on variables in J and is close to /. 

2. Use regression over all parity functions of degree at most d on variables in J to find the 
polynomial that best fits sampled examples. 

For XOS functions the first step involves simply choosing variables with large enough Fourier 
coefficients of degree 1. The analysis of both of these steps in |FV13j is in £2 norm and therefore 
we can directly plug in our new bounds to obtain Theorem 11.21 

In the case of submodular functions in [FV13| the algorithm that finds the set of influential 
variables only ensures that there exists a function that depends on variables in J and is close in ii 
distance to /. We therefore provide an analogous result for £ 2 - As in |FV13| our algorithm selects 
all variables that have a large degree-1 or 2 Fourier coefficient (with different values of thresholds). 
The set of variables it returns is larger but analysis is substantially simpler than that in [FV13| . 

Before proceeding we will need a few simple definitions. For a real-valued / over {0,1}” and 
e G [0,1] let Sf{e) denote the smallest s such that there exists an s-junta g for which ||/ — g \\2 < e. 
For a set of indices J C [n] we say that a function is a J-junta if it depends only on variables in J. 
For a function / and a set of indices I, we define the projection of / to I to be the function over 
{0,1}*^ whose value depends only on the variables in I and its value at xj is the expectation of / 
over all the possible values of variables outside of I, namely fi{x) = Ey^u[f{xi,yi)]- Observe that 
an equivalent representation of // is as follows: 

fiY) = J2fiS)xs{x)- 

SCI 


We will also need the following bound on the number of variables with large degree-1 or degree- 
2 Fourier coefficient from [FV13| and the property of degree-2 Fourier coefficient of submodular 
functions from |FKV13] . 


Lemma 5.8 f |FV13j I Let f : {0,1}" —>■ [0,1] be a submodular function and a, /? > 0. Let 


I = 


[i l/({*})l > a}lj{* I 3/, |/({i,/})| >/?} . 


Then |/| < 


2 

min{Q:,/3} ' 
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Lemma 5.9 ([FKVTS]) Let f : {0,1}"' ^ [0,1] be a submodular function and i,j G [n], i j. 


^ S9i,j 


We now state the guarantees of our algorithm for finding relevant variables. 


Theorem 5.10 Let f : {0,1}” —>■ [0,1] be a submodular function. There exists an algorithm, that 
given any e > 0 and access to random and uniform examples of f, with probability at least 5/6, 
finds a set of variables I of size at most 32 • (s/(e/2))^/e^ such that there exists a submodular I- 
junta h satisfying \\f — h \\2 < e. The algorithm runs in time 0(n^log(n) • (s/(e/2))^/e‘^) and uses 
0 (log(n) • (s/(e/ 2 ))^/e^) examples. 

Proof: Denote s = Sf{e/2) and let J be the set of variables such that there exists a J-junta g such 
that II/ —5'||2 < e/2. We can assume without loss of generality that g = fj since fj is a submodular 
J-junta and it is the J-junta closest to / in £2 distance. Let 


i' = U 


l/(W)l > 


4- 


Uh 


3i,l/({i,i))l > 


8 - 


We claim that ||/j — //'njib < e/2. Note that by triangle inequality this would imply that ||/ — 
//'njib < e meaning that it would suffice to find the variables in I'. 

Using Lemma 15.91 and the definition of I', we prove the claim as follows; 


Il/j-//'nj|l2= E 

scj, sgr 

= E /({>»' + E 


f{S)‘ 




-l 6 + 


SCJ, SgJ', |S|>2 


16 • s 




E 


E 


E 


ijGJ, i>j SQJ, SBiJ 

2 -|/({bi})| 


i,jeJ, i>j 

“16 2 8 - 52-4 

All we need now is to find a small set of indices ID/'. We simply estimate degree-1 and 2 
Fourier coefficients of / to accuracy e2/(32-s2) with confidence at least 5/6 using random examples. 
Let f{S) for S C [n] of size 1 or 2 denote the obtained estimates. We define 


I=u 


l/(W)l > 


3e 


16 • y/s 


U » 


3j,|/({bj})| > 


3e2 


32 • s2 


If estimates are within the desired accuracy, then clearly, / D At the same time / C where 

^2 


I" = {i 


l/(W)l > 


8 - 




3/,|/({b/})| > 


16- s 2 
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By Lem. ESI \I"\ < 32 • s'^/e^. 

Finally, to bound the running time we observe that, by the standard application of Chernoff 
bound with the union bound, 0(log(n) • /e^) random examples are sufficient to obtain the desired 

estimates with confidence of 5/6. The estimation of the coefficients can be done in 0(n^log(n) • 
s^/e^) time. □ 

We can now use the result from |FV13j that for every submodular function / : {0, !}"■ ^ [0, 1], 
Sf{e/2) = 0(log(l/e)/e^) to obtain the following corollary. 

Corollary 5.11 Let f : {0,1}” ^ [0,1] be a submodular function. There exists an algorithm, that 
given any e > 0 and access to random and uniform examples of f, with probability at least 5/6, 
finds a set of variables I of size 0(l/e®) such that there exists a submodular I-junta h satisfying 
11 / — < e- The algorithm runs in time Oinf and uses 0(log(n)/e^^) examples. 

We use Corollarv l5.11l with the least squares regression over polynomials of degree 0(log(l/e)/e^/®) 
on the influential variables to obtain the learning algorithm claimed in Theorem ll.il 

Finally, for completeness we also state the corollaries for agnostic learning of XOS and submod¬ 
ular functions: 

Theorem 5.12 Let Cg be the class of all submodular functions with range in [0,1]. There exists 
an algorithm that learns Cg agnostically with excess i 2 -error e and runs in time . 

Theorem 5.13 Let Cx be the class of all XOS functions with range in [0,1]. There exists an 
algorithm that learns Cx agnostically with excess ^ 2 -error e and runs in time 

6 Lower Bounds 

In this section we prove tight lower bounds on low-degree spectral concentration and learning of 
monotone submodular, XOS and self-bounding functions. 

6.1 Monotone Submodular Functions 

We start by showing that for any e > 0 there exists an explicit monotone submodular function 
over 0(e“^/®) variables that requires degree 0(e“^/®) to £ 2 -approximate within e. The “hockey- 
stick” function of k (out of n) variables is defined as follows: hsfc(x) = min{l,2 • Wk{x)/k}, where 
Wk{x) = Xi is the Hamming weight of the first k bits of x. In [FKV13] it was shown that this 
function has a Fourier coefficient of degree k whose value is at least Ll{k~‘^/'^). This immediately 
implies a lower bound of on degg^(hsfc) for k = 0(e“^/^). We now give a more careful 

analysis of the low-degree spectral concentration of hs^ that leads to the nearly tight lower bound. 

The hockey-stick function is closely related to the well-studied Boolean majority function for 
which tight spectral concentration bounds are known |0’D14] . Specifically, it is easy to see that 
for every i, 


dihskix) = 2{l-mafi{x))/k, (5) 

where ma]^{x) = I if Wk{x) > A:/2 and 0 otherwise. This correspondence allows us to easily obtain 
a lower bound on the low-degree spectral concentration of hsfc(x) . 
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Lemma 6.1 For any k < n and d < k/2, VL^'^(hsA;) = ^{d jk). In particular, for some 
constant ci, k = c\e~^l^ and d = [k/2\ gives VL^'^(hsfc) > e^. 


Proof: We first observe that by the properties of partial derivatives given in Sec. [2] and eq.®, for 
every S C [k] such that l^l > 2 and i G S, 


hskiS) = -diUsk{S\i)/2 = 


majfc(>5'\0 

k 


For 2 < j < k, 


E 


majfc(5\i) 




W^'(hsfc) = hskiSf = 

SC[fc], \S\=j SC[fc], \S\=j, i&S 

k-j + 1 ^ majfc(S’)2 _k-j + l 


E 


n ^ _/ j^2 

5C[fc], \S\=j-l 


k‘^j 


(majfc) 


We now use the estimate ^(maj^) > c(j —1) for some constant c > 0 |()’D14] . This estimate 
implies that 


W>'^{hsk) > Y1 


2k/3+l>j>d 


2k/3+l>j>d 


k-j + 1 
k^j 




E 

2k/3>j>d 

2 

3k 3 


k-j ~5/2 - C 


> 


k^ ^ 3A: ^ 


•-5/2 


> 


2kl3>j>d 
2c 

^ ' k '' 


3k 


r2k/3 

Jd 


t-^/^dt 


where in the last inequality we used the condition that d < k/2 and hence d 

^-3/2/3. 


(2A:/3)-3/2 > 
□ 


We now show that any algorithm that PAC learns monotone submodular functions with £2 error 
of e must use 2^*^^ examples. This result is based on a reduction from learning the class all 
Boolean functions on k variables with error 1/4 to the problem of learning submodular functions 
on 2t = fc + [log/c] +0(1) variables with I 2 error of ©(^ 5 ^). Any algorithm that learns the class 
of all Boolean functions on k variables to accuracy 1/4 requires at least 2P'^^'l bits of information 
about the target function and, in particular, at least that many random examples or other Boolean¬ 
valued queries are necessary. The reduction is identical to the reduction in [FKV13| which proved 
an analogous result for learning with l\ error of ©(p^). Therefore the analysis of the reduction 
follows closely that from |FKV13| . 


Lemma 6.2 For k > 0, let t > 0 be the smallest such that (2^*) > 2^ (and thus 4 • 2^ > (2^*) > 2^). 
For every Boolean function h : {0,1}^ {0,1} there exists a monotone submodular function 

f : {0,1}2* —>■ [0,1] such that: 

1. f can be computed at any point x G {0,1}2* in at most a single query to h and in time 0{k); 
given a single random and uniform example of h, a random and uniform example of f can be 
produced in time 0{k). 


20 
















2. Let a = = 0(1)- For any /3 > 0, given a function f : {0,1}^* —)• M such that ||/ —/II 2 < 

^^ 4 , , one can obtain a Boolean function h : {0,1}^ ^ {0,1} such that Prii[h{x) h{x)] < j3 
and h can he computed at any point x E {0,1}^, with a single query to f in time 0{k). 


Proof: We construct / by embedding h into the middle layer of hs = hs 2 t while preserving the 
monotonicity and submodularity. The embedding modifies the values of hs by at most 

Let s = Let Mt = {x E {0,1}^* | W2t{x) = t} be the middle layer of {0,1}^* and let 

m : { 0 , 1 }^ — >■ Mt be an injective map such that both m and m~^ (whenever it exists) can be 
computed in time 0 {k) at any given point (for example using lexicographic ordering on both sets). 
We now define / as: 


f{x) 




hs(x) 

_ i-fefa) 

2t 

1 


X ^ Mt 

X & Mt and 3y E {0,1}^, m{y) = x 
otherwise 


Notice that given any x E {0,1}^*, the value of f{x) can be computed using a single query to h 
and it is easy to see that given a single random and uniform example of h we can output a random 
and uniform example of / in time 0 {k). 

Given a function / : {0,1}^* ^ M, define h : {0,1}^ ^ {0)1} so that h{y) = 1 if f{m{y)) > 
(1 — (l/4t)) and h{y) = 0 otherwise. By definition, 


Hv) + h{y) => \ f{m{y)) - f{m{y))\ > 

Using that Prx,~^it 2 t[^yi 'ox{y) = x] = we have: 

PrwJ/i(y) / h{y)] < Pr^^UiAlfix) - f{x)\ > l/4f | 3y, m{y) = x] 

< {Atf ■ E,;r^u 2 t[\fix) - f{x)A I 3?/, m{y) = x] 

< (4t)2 . ^xr~.U 2 t[\fix) - /(x)p] ^ 16 • _ 11 ^ _ ^||2 

Using II/- /II 2 < we have: Ba^r^uAHx) ^ h{x)] < /3. 

Now, observe that hs is monotone and / is obtained by modifying hs only on points in Mt and 
by at most which ensures that for any x < y such that W 2 t{x) < W 2 t{y), /(x) < f{y)- Finally, 
we show that / is submodular for any Boolean function h. It will be convenient to switch notation 
and look at input x as the indicator function of the set Sx = {xt \ x^ = 1}. We will verify that for 
each S C [n] and i,j ^ S, 


f{su{i}) - f{S) > f{su{i,j}) - f{su{j}). 


( 6 ) 


Notice that hs is submodular, and / = hs on every x such that W 2 t{x) / t. Thus, we only need to 
check eq.dl]) for S,i,j such that l^l E {t — 2,t — l,t}. We analyze these 3 cases separately: 

1. I^l = t — 1 : Notice that f{S) = hs(S') = 1 — (1/t) and f{S U {//}) = hs(5 U {i,j}) = 
1. Also observe that for any h, f{S U {f}) and f{S U {/}) are at least (1 — ^). Thus, 
f{S U (zj) + f{S U {/}) > 2 - 1 = fiS) + f{S U {z, jj). 
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2. I^l = t — 2 : In this case, f{S) = (1 — (2/t)) and f{S U {i}) = f{S U {j}) = (1 — (1/t)). In 
this case, the maximum value for any h, of f{S U {i,j}) = 1- Thus, 

f{S) + f{S U {ij}) < 2 - (2/f) = fiS U {i}) + fiS U {j}). 

3. jSI = t : Here, f{S U {i}) = f{S U {j}) = f{S U {i,j}) = 1- The maximum value of f{S) for 
any h is 1. Thus, 

f{S) + f{S U <2 = f{SU {z}) + fiS U {j}). 

This completes the proof that / is submodular. □ 

By choosing /3 = 1/4 in Lemma 16.21 we obtain the following result: 

Theorem 6.3 Any algorithm that PAC learns all monotone submodular functions with range [0,1] 
to £2 error of e > 0 requires 2^^^ random examples of (or value queries to) the target function. 


6.2 XOS functions 


The lower bounds for XOS functions are based on a simple mapping from monotone DNF (MDNF) 
formulas to XOS functions. We say that a function h is s-term t-MDNF if h{x) = Vje[s] 
where each Tj C [A:], \Tj\ < t and Tj{x) = Aier^ 

Lemma 6.4 For every s-term t-MDNF h : {0,1}*^ ^ {OA}? / : {0,1}*^ ^ [0,1] be given by 

f{x) = 1 — if X 0 and f{x) = 0 otherwise. Then f is an XOS function of size s + k. 


Proof: Let h{x) 
that 


Vje[s] s-term t-MDNF representation of h. Then it is easy to verify 

j [X) = max < max ————, max- Xi > . 

1 ITjl ie[fc] t \ 


□ 

An immediate corollary of Lemma 16.41 is that for any /3 > 0, a function g such that || / — ( 7 II 2 < 
yfP/{2t) gives a function h such that Vvi(^{x ) + h{x)] <(3 + 2 

To obtain our lower bounds, we rely on known results for MDNFs obtained by choosing random 
conjunctions of size 0(\/fc). Such MDNFs were first analyzed by Talagrand [Tal96] . For our spectral 
concentration lower bound we will use the fact that Talagrand’s DNFs are noise sensitive |MO02] 
together with a reverse connection between noise sensitivity and low-degree spectral concentration. 

We first recall the definition and basic properties of the noise sensitivity. 


Definition 6.5 (Noise sensitivity) For a E [0,1],x E {0,1}”, we define a distribution Na{x) 
over y E {0,1}” by letting yi = Xi with probability 1 — a and yi = 1 — Xi with probability a, 
independently for each i. For a Boolean function h, the noise sensitivity of h with noise rate a is 
defined as 

MSa{h) = A Hv)]- 

Noise sensitivity satisfies (e.g. Wmt ): 


i=0 


( 7 ) 
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The following theorem was proved in [MO02| . following Talagrand’s analysis [Tal96] . 

Theorem 6.6 ( [M002] ) For every k, there exists a y/k-MDNF h such that = 12(1). 

This result implies that such functions have a large Fourier mass above level 
Corollary 6.7 For every k, there exists a y/k-MDNF h such that for d = Q.{Vk), W^^{h) = 12(1). 

Proof: Equation ([7]) implies that for every d, 

k d 

NS„(/i) = ^ ^(1 - (1 - 2a)^) • W^{h) < i ^(1 - (1 - 2a)^) • W\h) + ^W>\h) 

i=0 i=0 

< i ((1 - (1 - + W>‘‘{h)) < i [2ad . ||/!||2 + W>\K)) 

= ad ■ \\h\\l + lE>'^(/i )/2 <ad + W>^{h)l2. 

By Theorem 16.61 there exists a \/fc-MDNF h such that for some constant c > 0, > C. 

Let d = c\fkl2 we obtain that 

W>‘{h) > 2 (nSi,^(A) - ^) > =■ 

□ 

From here we obtain a lower bound on low-degree spectral concentration of XOS functions using 
Lemma 16.41 

Theorem 6.8 For every e > 0 there exists k = 0(l/e^) and an XOS function f : {0,1)^ —>■ [0, ll 
such that degf2(/) = 12 (l/e). 

Proof: For A: > 0, let h be the \/fc-MDNF h such that for d = 0{Vk), W^‘^{h) = 12(1). Let / be 
the XOS function obtained from h using Lemma 16.41 Then, by the linearity of Fourier coefficients 
and the fact that / differs from 1 — only on a single point, we obtain that 

W>^{f) > W>^{h)/d^ - 2-^ = n{l/k). 

This means that for some k = 0(l/e^) and d = 12(l/e) we have W^^{f) > e^. □ 

Our lower bound for PAG learning of XOS functions is based on the following lower bound for 
learning MDNF by Blum et al. |BBL98| . 

Theorem 6.9 f [BBL98] 1 For any sufficiently large k and q > k, any algorithm that FAC learns 
t-MDNF for t = log(3gA:) over the uniform distribution and uses at most q random examples (or 
value queries) will have error of at least 1/2 — 0{log{qk)/Vk). 


We note that Theorem 16.91 implies a slightly weaker (by a logarithmic factor in the degree) 
version of Corollary 16.71 since low-degree spectral concentration implies learning (in fact, as shown 
DSFT~*~1^ this argument also implies a lower bound on £i-approximation by polynomials). We 


m 


now prove a lower bound for PAG learning XOS functions which we state for the ii error (which 
implies the same lower bound for £2 error). 
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Theorem 6.10 Any algorithm that PAC learns all XOS functions from {0, 1}” to [0, 1] with ii 
error of e > 0 requires random examples of (or value queries to) the target function. 

Proof: We reduce learning of i-MDNF over k variables (for t and k to be chosen later) to learning 
of XOS using Lemma 16.41 namely we replace each example (x, f{x)) with 1 — —and then 
replace the hypothesis h{x) with h' such that h'[x) = 1 whenever h{x) > 1 — l/(2t). By Lemma 
16.41 any algorithm that achieves error of ^ — 2“^ gives a Boolean hypothesis for the MDNF 
problem with error of less than 1/4. 

By Theorem 16.91 there exists a constant c > 0 such that for q = 2'^'^ and t = log{3qk), the 
error of any PAC learning algorithm for f-MDNF that uses at most q random examples (or value 
queries) is at least 1/4. Note that 

_ 2 ~^__-_ 2~^ __-_ 2~^ 

81og(3gA;) 8{log{3k) + cVk) 

and therefore there exists a constant ci > 0 such that for every e > 0 and k = cije^, ^ — 2 “^ > e. 
Applying the guarantees of Theorem 16.91 we get that the number of random examples (or value 
queries) used to learn with ii error of e must be larger than q = 2^'^ = 2 ^^^/'^^ 

□ 


6.3 Self-bounding fnnctions 

We now show that upper bounds on low-degree spectral concentration that we proved for XOS 
and submodular functions cannot be extended to the whole class of self-bounding functions. Our 
construction is based on the classical Hamming code which we briefly describe here for completeness. 
For an integer r a Hamming code is a linear mapping (over GF(2)) c : {0,{0,1}^ such 
that for any two distinct v,w £ {0,the Hamming distance between v o c{v) and w o c{w) 
is at least 3, where we use “o” to denote the concatenation of strings. We now show that for 
k = 2'^ — r — la, Hamming code gives a way to embed any Boolean function into a self-bounding 
function which we describe below. 

Lemma 6.11 For an integer r, k = 2^ — r — 1 and any Boolean function h : {0,1}^ —>■ {0,1} let 
f : {0,1}^+^ —)• [0,1] be given by f{x o z) = h{x), if z = c(x) and f{x o z) = 1, otherwise. Then f 
is a self-bounding function. 

Proof: Let x o 2 : be a point in {0, l}^"*"^. If /(x o 2 ) = 0 then / cannot be lower on any point that 
differs from x o 2 in one coordinate, and therefore the self-bounding condition holds at x o z. If 
/(x 02 ) = ! then there exists at most one point y £ { 0 , 1 }^+'’ that differs from x o 2 in a single 
coordinate and f{y) = 0. This follows from the fact that, by definition of /, if f{y) = 0 then 
y = x' o c(x') for some x' £ {0,1}^. By the properties of c, any two points of this form are at 
Hamming distance at least 3 and therefore two distinct points cannot be at Hamming distance 1 
to X o 2 . This means that 


^ |/(xo2)-/((xo2)©ei)| < 1 = /(X 02 ). 
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A spectral concentration bound can be obtained by analyzing the embedding of a {0, l}-parity 
function h = mod 2. To avoid the direct calculation which requires using additional 

properties of the Hamming code we will derive the lower-bound via lower bounds for learning 
below. 

Theorem 6.12 Any algorithm that PAC learns all self-bounding funetions from {0,1}"' to [0,1] 
with £2 error o/e > 0 requires ^ random examples of (or value queries to) the target function. 

Proof: We reduce learning of all Boolean functions on A: = 2^ — r — 1 (for r to be chosen later) 
variables over the uniform distribution to learning of self-bounding functions using Lemma 16.111 
Namely, given a random and uniform example {x,£) of some unknown Boolean target function h 
we output a random example (x o z,£') of the function / that is equal to the embedding of h given 
by Lemma 16.111 This is done by choosing z uniformly from {0,1}'’ and having P = £ A z = c(x) 
and £' = 1 otherwise (a value query can be answered similarly using a single value query to h). 
Given a hypothesis / we define h{x) = 1 if f{xoc{x)) > 1/2 and h{x) = 0 otherwise. Observe that, 

/ h{y)] < Pr^oz^-.^Uk+rilfi^ ° ° ^)l > 1/2 I c(x) = z] 

< 4 • Ej,oz~Wfc+Xl/(xo z) - /(xo I c(x) = z] 

< 4 . ^xozr.U;,+A\f{x o z) - /(x O z)P ^ ^^^2 . IIJ _ ^||2 

We now let r = [log(l/e^)J -|- 4. This choice ensures that if / has £2 error of less than e then 
h has error of less than 1/4. Learning all Boolean functions to error of at most 1/4 requires 
2 ^{k) _ 2 f^( 2 ’') = random examples (or value queries) and therefore we obtain our claim. □ 

We now observe that there exists some constant c such that degg^(/) < c/e^. Otherwise, for any 
constant cq, using Theorem 15.71 we could obtain an algorithm that learns self-bounding functions 
using random examples contradicting Theorem 15.71 

Theorem 6.13 For every e > 0 there exists k = 0{lfe‘^) and a self-bounding function f : {0,1}^ —)■ 
[ 0 , 1 ] such that degf^(/) = H(l/e^). 

Acknowledgements We would like to thank Pravesh Kothari for useful discussions and his help 
with the proof of Lemma l 6 .II 
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A Rademacher complexity, XOS and self-bounding functions 


The Rademacher complexity of a class of functions J- is one of the most popular and powerful tools 
in statistical learning theory for proving uniform convergence bounds on the generalization error 
[KPOOl IBM02] . Specifically, for a possibly unknown distribution V over some domain X we would 
like to upper bound the value of n for which 



sup 

E.^7 .[/(x)] - i j; f{X) 

> e 


/6.F 

^ ieln] 



Such bounds imply learnability via empirical loss minimization and can be obtained by considering 
the Rademacher complexity of X relative to V which is defined as follows: for a (multi-)set S' of n 
points from X let the empirical Rademacher complexity be defined as 


77(7' o S) = 


n 


sup V , 


where a is distributed uniformly over { — 1,1}”, or equivalently each Uj is an independent Rademacher 
variable. More generally, Rademacher complexity of any bounded set of vectors V C M” is defined 
as 


n I ’ J 
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vev 
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CTiVi 


The Rademacher complexity of X over V for sample size n is then IZn{X, V) = [77(7'oS)]. To 

study the concentration properties of empirical Rademacher complexity it is viewed as a function 
over subsets of [n] defined as 


77(^oS,A) = -E_{_i,i} 


supVcJi/(x*) , 
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in other words it measures the Rademacher complexity of S restricted to points with indices in 
A. This function is known to be self-bounding — an essential property for the applications of 
Rademacher complexity that rely on strong concentration of measure results (e.g. [BLB03j h Here 
we show that Rademacher complexity of any set of vectors V is in fact an XOS function. For 
completeness, in Section IA.2I we show that this is a strictly smaller class than that of monotone 
self-bounding functions. 


A.l Equivalence of XOS and Rademacher complexity functions 

For convenience we remove the normalizing factor ^ in the definition of the Rademacher complexity 
since it does not affect the membership of a function in XOS. 

Theorem A.l Let V be a bounded set of vectors from R". Then function cf : 21”! —>■ R defined as 


n{v,A) = 

n I 


sup 
vev : 


E 


<JiVi 


eA 


is XOS. 

Proof: For convenience we prove that 4>{A) = n • TZ{V, A) is XOS which naturally implies that 
TZ{V, A) is XOS. We first observe that = 0. Next we show that (p is monotone. For simplicity 
we assume that V is compact (the extension to general sets is straightforward). For A and a vector 
a G {—1,1}"^ let G R be a vector such that ~ subsets 

A (Z A! C. [n] and a' G {—1,1}"^^ we denote by the vector containing the bits of a' with indices 
in A. Then 
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A a' 

where we used the fact that for i G A \ T is independent of uT 

The function </> has non-negative range and (/>(0) = 0. Therefore it is sufficient to prove that </> 
is fractionally subadditive. That is we need to prove that 4>{A) < Ylje[m] whenever Pj > 0 

and Ylj i&Bj > 1 Vi G d. Monotonicity of cp implies that it is sufficient to prove this condition 
for exact fractional covers: that is Y^j-ies Aj = 1 Vi G d. This condition implies that for every 
vector w G R”, 

^ ^PjWi = ^Wi. 
jG[m\iGBj i£A 
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Using this equality we can conclude: 
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□ 

We remark that this proof also applies to Gaussian complexity of a set of vectors V, another 
measure of complexity studied in convex geometry and statistical learning theory. In this measure 
in place of a Rademacher variable, a 0-mean Gaussian with variance 1 is used (the only fact about 
iTj’s that we used is that it is 0-mean and independent of all other variables). 

It turns out that the converse of Theorem I A. 1 1 is also true. Any XOS function can be represented 
as Rademacher complexity of some set of vectors. 


Theorem A.2 Let f : 2^""^ -A M be an XOS function. Then there exists a set V such that for every 
set AC [n], 


f{A) = 

n ^ ’ I 


max 

vev 


E 

ieA 


CTiVi 


Proof: By definition of XOS, there exists a set of clauses C such that f{A) = maxcgc 
for some non-negative weights Wd- Let 


V = {-{Wciai,Wc2<T2, ■ ■ .,Wcn(^n) | C E C, fJ E {-1, 1}”}. 
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Then for every A and a, 
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This implies that 
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A.2 Separation of XOS and monotone self-bounding functions 

Here we show a simple monotone self-bounding function which is not XOS. We remark that formally, 
such a separation is trivial since XOS functions must satisfy /(O) = 0, unlike monotone self- 
bounding functions. Here we present a more interesting example, a function / ; {0,1}^ ^ M+ such 
that / is 1-Lipschitz monotone self-bounding, /(O) = 0 and / is not XOS. The function is defined 
as follows (in set notation): 

• /( 0 ) = 0 

. /({!}) = 1/5, /({2}) = 2/5, /({3}) = 3/5 
. /({1,2}) = 3/5,/({1,3}) = 4/5,/({2, 3}) = 3/5 
. /({1,2,3}) = 1 

The reader can verify that this function is monotone self-bounding but not XOS (in fact not even 
subadditive, since /({1,2}) -|-/({3}) > /({I, 2,3})). 
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